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Abstract 

In this paper we study the problem of finding the ap- 
proximate nearest neighbor of a query point in the high 
dimensional space, focusing on the Euclidean space. The 
earlier approaches use locality-preserving hash functions 
(that tend to map nearby points to the same value) to 
construct several hash tables to ensure that the query 
point hashes to the same bucket as its nearest neigh- 
bor in at least one table. Our approach is different - 
we use one (or a few) hash table and hash several ran- 
domly chosen points in the neighborhood of the query 
point showing that at least one of them will hash to the 
bucket containing its nearest neighbor. We show that 
the number of randomly chosen points in the neighbor- 
hood of the query point q required depends on the en- 
tropy of the hash value h{p) of a random point p at the 
same distance from q at its nearest neighbor, given q 
and the locality preserving hash function h chosen ran- 
domly from the hash family. Precisely, we show that 
if the entropy I{h{p)\q,h) — M and g is a bound on 
the probability that two far-off points will hash to the 
same bucket, then we can find the approximate near- 
est neighbor in 0{n p ) time and near linear 0(n) space 
where p = M/ \og(\/g). Alternatively we can build a 
data structure of size 0(n 1 /' 1_p ^) to answer queries in 
O(d) time. By applying this analysis to the locality pre- 
serving hash functions in |17l 1211 Ej and adjusting the 
parameters we show that the c nearest neighbor can be 
computed in time 0{n p ) and near linear space where 
p « 2.06/c as c becomes large. 

1 Introduction 

In this paper we study the problem of finding the near- 
est neighbor of a query point in the high dimensional 
Euclidean space: given a database of n points in a d 
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dimensional space, find the nearest neighbor of a query 
point. This fundamental problem arises in several appli- 
cations including data mining, information retrieval, and 
image search where distinctive features of the objects are 
represented as points in R d [23 EZl El H IIH EDI EH El 
While the exact problem seems to suffer from the "curse 
of dimensionality" (that is, either the query time or the 
space requried is exponential in d I23|). many effi- 
cient techniques have been devised for finding an approx- 
imate solution whose distance from the query point is at 
most 1 + e times its distance from the nearest neighbor. 
H3 El EH H21 The best known algorithm for find- 
ing an (1 + e)-approximate nearest neighbor of a query 
point runs in time O(dlogn) using a data structure of 
size (nd) ^ 1 ^ ' . Since the exponent of the space require- 
ment grows as 1/e 2 , in practice this may be prohibitively 
expensive for small e. Indeed, since even a space com- 
plexity of (nd) may be too large, perhaps it makes more 
sense to interpret these results as efficient, practical al- 
gorithms for c-approximate nearest neighbor where c is 
a constant greater than one. Also, this is meaningful in 
practice as typically when we are given a query point we 
are really interested in finding a neighbor that is much 
closer to the query point than the other points - the 
query point (say an image) really represents the 'same 
object' as the nearest neighbor we expect it to 'match' 
except that they may differ a little due to noise, or in- 
herent errors in how well points represents their objects, 
but it is expected to be quite far from the other points in 
the database which basically represent 'different objects' 
from the query point. 

For these parameters, Indyk and Motwani |17| provide 
an algorithm for finding the c-approximate nearest neigh- 
bor in time Oid + n x / c ) using an index of size 0(n 1+1//c ) 
(while their paper states a query time of 0(dn x l c ), if d 
is large this can easily be converted to 0(d + n 1 / c ) by di- 
mension reduction); with a data structure of near linear 
size, for the hamming space, the algorithms in ^| |^ 
require a query time of n°' log c / c ) . To put this in perspec- 
tive, finding a 2-approximate nearest neighbor requires 
time 0(^/n) and an index of size 0(n^/n). The exponent 



was improved slightly in jS] for c in [1, 10] - instead of 
1/c it was (3 /c where (3 is a constant slightly less than 1 
for c < 10; for example when c = 2 they can reduce the 
exponent to approximately 0.42 implying a running time 
of n 0A2 and an index of size n 1A2 . Their simulation re- 
sults indicate that while locality sensitive hashing gives 
faster query time over other data structures based on 
fed-tree, it also comes at the expense of using a lot more 
space. They work with the following decision version 
of the c-approximate nearest neighbor problem: given a 
query point, and a parameter r for the distance to its 
nearest neighbor, find any neighbor of the query point 
that is that distance at most cr. It is well known that the 
reduction to the decision version adds only a logarithmic 
factor in the time and space complexity |171 IT2*] . 

In their formulation, they use a locality sensitive hash 
function that maps points in the space to a discrete space 
where nearby points out likely to get hashed to the same 
value and far off points out likely to get hashed to differ- 
ent values. Precisely, given parameter m that denotes an 
upper bound on the probability that two points at most 
r apart hash to the same bucket and g a lower bound 
on the probability that two points more than cr apart 
hash to the same bucket, they show that such a hash 
function can find a c-approximate nearest neighbor in 
0(d + n p ) time using a data structure of size 0(n 1+p ) 
where p = log(l/m)/log(l/g). 

Their approach is to construct several hash tables to 
ensure that the query point hashes to the same bucket as 
its nearest neighbor in at least one table. Our approach 
is different - we use one (or a few) hash table and hash 
several randomly chosen points in the neighborhood of 
the query point showing that at least one of them will 
hash to the bucket containing its nearest neighbor. We 
show that the number of randomly chosen points in the 
neighborhood of the query point q required depends on 
the entropy of the hash value h(p) of a random point p 
at distance r from q, given q and the locality preserving 
hash function h chosen randomly from the hash family. 
Precisely, we show that if the entropy I(h(p)\q,h) = M 
then we can find the approximate nearest neighbor in 
0(d + n p ) time and near linear space 0(n) where p = 
M/log(l/p). Here I(h(p)\q,h) denotes the entropy of 
h(p) for a random point p at distance r from q given the 
query point q and the specific hash function h from the 
hash family in use. Alternatively we can build a data 
structure of size 0(n 1 /( 1 ~' 3 * 1 ) to answer queries in 0(d) 
time. By applying this analysis to the locality preserving 
hash functions in ^H^Uj] and adjusting the parameters 
we show that the c nearest neighbor can be computed 
in time n p and near linear space where p w 2.06/c as 



c becomes large. For c = 2, p turns out to be about 
n o e9 . Note that I(h(p)\q, h) can be much lower than 
I(h(p)\h(q)) - the latter corresponds to guessing h(p) 
from h(q) and can lead to much slower algorithms. For 
example in the Euclidean case an algorithm based on 
the latter entropy would give a much higher value of p of 
about 0(log c/c), but using both h and q in conjunction 
instead of just h(q) gives us the improved results. We 
also show that if the points are chosen randomly from a 
spherical gaussian distribution (section^ the value of p 
can be improved to about 1.47/c 

A major advantage of such a small index of size 0(n) is 
that the entire index could possibly fit in main memory 
making all memory accesses RAM accesses instead of the 
much slower disk accesses. This suddenly increases the 
number of possible accesses in the same query time by 
a factor of 1000's ! If there is a unique c-approximate 
nearest neighbor - which may be typical in practice - 
we argue that only 2 log n bits of storage are required in 
the index for each point for large enough values of c. So 
even with a million entries, we need only an index of size 
5MB which is a trivial amount of RAM space in today's 
PCs. 

Application of our techniques to the LI norm does 
not result in any improvement over the previous results 
- with linear space we get a value of the value of p about 
log(c)/c matching the bounds in |2THT7) . 

2 Results 

• B(p,r): Let B(p,r) denote the sphere of radius r 
centered at p a point in M. d ; that is the set of points 
at distance r from p. 

• I(X): For a discrete random variable X, let 
I(X) denote its information-entropy. For exam- 
ple if X takes N possible values with probabili- 
ties Wi, W2, u>n then I(X) — I(wi,ui2, ■■, wn) = 
Y,I( w i) = ^2-WilogWi 

We will work with the following decision version of the 
c-approximate nearest neighbor problem: given a query 
point and a parameter r indicating the distance to its 
nearest neighbor, find any neighbor of the query point 
that is that distance at most cr. We will refer to this 
decision version as the (r, cr)-nearest neighbor problem 
and a solution to this as a (r, cr)-nearest neighbor. It 
is well known that the reduction to the decision version 
adds only a logarithmic factor in the time and space 
complexity [T71 IT2] . 



We use locality preserving hash functions to map 
database points into a hash table; a locality preserving 
hash function is a random function from a hash family 
that is likely to hash nearby points to the same value and 
far off points to different values in a discrete space. To 
find the approximate nearest neighbor of a query point, 
we hash several randomly chosen points in the vicinity of 
the query point and show that the approximate nearest 
neighbor is likely to be present in one of these buckets. 

We assume that the locality preserving hash function 
has the following properties. Let M denote the entropy 
I(h(p)\q, h) where p is a random point in B(q,r). Here 
I(h(p)\q, h) denotes the entropy of h(p) given the query 
point and the specific hash function from the hash family 
in use. Let g denote an upper bound on the probabil- 
ity that two points that are at least distance cr apart 
will hash to the same bucket. Note that after a random 
rotation and a random shift of the origin the nearest 
neighbor of q appears like a random point on B(q,r). 
Our algorithm is simple: 

Construction of hash table: Pick k = 
logn/log(l/g) random hash functions hi, h%, .., h^. For 
each point p in the database compute (after random 
rotations and shifts for each hash function) H(p) — 
(hi(p),h2(p), ..,hk(p)) For each point p, store it in a ta- 
ble at location H(p); use hashing to store only the the 
nonempty locations. Use polylogn such randomly con- 
structed hash tables. 

Search Algorithm: To find a point at distance at 
most cr from a query point q given that there is a neigh- 
bor at distance at most r from q, pick 0(n p ) random 
points v from B(q,r) and search in the buckets H(v). 
Here p = M/ log(l/p). 

Theorem 1 With probability at least 0(1), if the near- 
est neighbor of the query point is at distance r, the search 
algorithm finds a neighbor at distance at most cr. With 
constant probability, no more than 0(n p ) time is spent 
searching points that are at a distance more than cr from 

q- 

By using polylogn hash tables our algorithms can be 
made to succeed with high probability. 

Alternatively, we show that our methods can be used 
to construct a data structure of size 0{n 1 ^ 1 ~ p ' ) ) to an- 
swer queries in O(d) time. 

By applying this analysis to the locality preserving 
hash functions from El HI an d adjusting the pa- 
rameters we show that the c nearest neighbor can be 
computed in time 0(n p ) and near linear space where 
p sa 2.09/c as c becomes large. For c = 2, p turns out to 
be about 0.69. 



We start in section with preliminaries including a 
crucial lemma that states the number of random sam- 
ples required for an arbitrary random variable to guess 
its specific value. To simplify the exposition of the basic 
principles, in section 0] we study a random instance of 
the nearest neighbor problem in Euclidean space where 
the points in the database are chosen randomly from a 
spherical gaussian distribution. In section El we prove 
the main theorems applicable to nearest neighbor search 
for arbitrary point sets and derive algorithms for nearest 
neighbor search in Euclidean space. Finally, in section 
we discuss some computational issues relevant for prac- 
tical implementation. 

3 Preliminaries 

First let us go through some notations. 

• N(p,r),n(x): Let N(p,r) denote the normal dis- 
tribution with mean p and variance r 2 with proba- 
bility density function given by r% / 27r e ~^~ A '' > ^ 2r 
Let n(x) denote the function — h=e~ x I 2 . 

• N d (p, r): For the d-dimcnsional Euclidean space, for 
a point p — (pi,P2, ■-,Pd) € K d let N d (p,r) denote 
the normal distribution in M d around the point p 
where the ith coordinate of a random point has the 
normal distribution N(pi,r/\fd) with mean p, and 
variance r 2 /d. It is well known that this distribution 
is spherically symmetric around p. A point from this 
distribution is expected to be at root-mean squared 
distance r from p; in fact, for large d its distance 
from p is close to r with high probability (see for 
example lemma 6 in \17\ ) 

• erf(x), $(x): The well-known error function 
erf(x) = Jq e~ x dx, is equal to the probabil- 
ity that a random variable from iV(0, l/v2) lies 
between —x and x. Let $(x) = 1 ~ er /( :E /^) _ p or 
x > 0, <&(x) is the probability that a random vari- 
able from the distribution N(0, 1) is greater than 
x. 

• Use a 1.303 to denote the constant: 
J °° /(<&(&), 1 — $(x)) dx. The approximate value of 
this integral has been computed using Matlab. 

• Projection: We will use the following commonly 
used projections that map points in Euclidean space 
to real numbers. Let v denote a random vector 
from the distribution N d (0, yd). Then for any point 



p £ M. d , the projection f{p) — v.p is distributed ac- 
cording to the normal distribution iV(0, ||p||) where 
| \p\ | is the Euclidean distance of p from the ori- 
gin. Several such projections can be used to project 
a point p into a low (say k) dimensional space 
- for example, we can have the function F(p) = 
{flip), hip), fk{p)) for random choices of projec- 
tion functions fi, fk- 

The following are well known facts about such ran- 
dom projections (they are direct consequences of the 2- 
stability of the normal distribution |28|): 

Fact 1 Under a random projection described above, for 
any points p and q, F{p) — F{q) has the distribution 
N k (0,d(p,q)) where d(p,q) denotes the distance between 
p and q. So the distribution of F{p) — F(q) depends only 
on the distance d(p, q) and not on the positions of p and 

q- 

Fact 2 If r is random point on B(p,r), then F(r) — F(q) 
has the distribution N(0, \/{d(p, q) 2 + x 2 )). 

Guessing the value of a random variable: If a random 
variable takes one of N discrete values with equal prob- 
ability then a simple coupon collection based argument 
shows that if we guess N random values at least one of 
them should hit the correct value with constant proba- 
bility. The following lemma states the required number 
of samples for arbitrary random variables so as to 'hit' a 
given random value of the variable. It essentially states 
how many guesses are required to guess the value of a 
random variable. 

Lemma 2 Given an random instance x of a discrete 
random variable with a certain distribution fl with en- 
tropy I, if 0(2 I ) random samples are chosen from this 
distribution at least one of them is equal to x with prob- 
ability at least $1(1/7). 

Proof: Let w\,W2, ...,u>at denote the probability distri- 
bution £1 of the discrete space. 

After s — A.(2 J + 1) samples the probability that x is 
chosen is J2i w d^ ~ (1 — w i) s }- 

If Wi > 1/s then the term in the sum is at least u>i(l — 
1/e). So if all the w^s that are at least 1/s add up 
to at least 1/7 then the above sum is at least $7(1/7). 
Otherwise we have a collection of w^s each of which is at 
most 1/s and they together add up to more than 1 — 1/7. 

But then by paying attention to these probabili- 
ties we see that the entropy I — J2i w i l°g(V u 'i) — 
Epilog* > (1 - 1/7) logs > (1 - !//)(/ + 2) = 



I + 1 — 2/7. For 7 > 4, this is strictly greater than 
7, which is a contradiction. If 7 < 4 then the largest Wi 
must be at least 1/16 as otherwise a similar argument 
shows that 7 = ^ tUj log(l/to<) > u>ilogl6 = 4, a con- 
tradiction; so in this case even one sample guesses x with 
constant probability. □ 

Remark 1 While the above lemma assumes that the 
random samples are chosen from the same distribution 
from which x was derived, it is easy to extend it to the 
case where random samples are chosen from a distribu- 
tion slightly different from f2 ; where say the probabilities 
of corresponding events differ at most by a constant fac- 
tor. For example the random samples could be chosen 
from a distribution Q' = {w^, w' 2 , w' N ) where the indi- 
vidual probabilities differ from the ones in the distribu- 
tion £1 — {wi, W2, ■ wn) by at most a constant multi- 
plicative factor. 

Remark 2 The above result is tight to the extent that 
you cannot get a probability much better than 17(1/7) 
with 0{2 I ) samples. There is a distribution with entropy 
I so that even picking 0(I2 I ) samples will hit x only 
with probability 0(log7/7). The distribution has one el- 
ement with probability 4 log 7/7 and all others with equal 
probability of 0(1/(I 2 2 1 )). The converse of the lemma 
is not necessarily true. That is, there may be a distri- 
bution with entropy I , and it may be sufficient to pick 
much fewer than 2 1 samples - in fact just one sample - 
to hit x is significant probability. Think of a distribution 
where one element has probability 1/2 there are an expo- 
nentially large number of remaining elements with tiny 
uniform probability. 

4 Random Instance in Euclidean 
Space 

We study a random instance of the problem where each 
point is distributed according to N d (0, l/y/2). The rea- 
son we choose this distribution with a deviation of l/v2 
is because the expected distance between any two points 
is 1; in fact, the distance is very close to 1 with high prob- 
ability for large d. The query point is randomly chosen 
around a certain point p with distribution N d (0, 1/c); 
the query point is at distance close to 1 /c from its near- 
est neighbor. The idea is to use the random projections 
to a real line introduced earlier. 

For two points separated by distance x, the distance in 
the projection is distributed as N(0, x). We use k = logn 
such projections. For each point p this gives a vector of 



real numbers F(p) = /2O 5 ), — , fk(p))- For each 

projection we produce a bit hi{p) = if fi{p) < and 
1 otherwise, giving H(p) = (fti(p), h 2 (p), ...,hk(p)) This 
hashes each point to an element of {0, l} fe . If k = logn, 
the number of points in any one hash bucket (bin) is at 
most logn with high probability. 

Unfortunately, the query point q may not hash to 
the same bucket as its nearest neighbor p. We will 
try to guess H(p). It can be shown that the hash 
values H(p) and H(q) are expected to differ in about 
0(l/c) fraction of the bits. Based on this fact we may 
need to search a large number of hash buckets, up to 
(fc/c) ~ n /(1/c4 ~ 1/c) « n°( lo s c / c ) for large c. 

Our essential observation is that this search space can 
be pruned significantly by paying attention to the vector 
F(q) from which H(q) is derived. If a coordinate fi(q) 
is far from 0, it is less likely that hi{p) and hi{q) will 
differ. In fact, if the absolute value, |/i(<?)| = x then for 
hi{p) and hi(q) to differ, the projection fi must map p 
at least x away from q. This happens with probability at 
most e~°( x c \ This is exponentially small in c except 
when x is comparable to 1/c which happens only with 
probability about 1/c. So the search space of H{p) given 
F{q) is much smaller than ( fe / c ). To estimate the size 
of this search space precisely we compute the entropy 
of H{p) given F(q). If this is M, then by lemma |2 the 
search space is about 0(2 M ). 

Now, I{H{p)\F(q) < El(h(p)\fi(q)). Let us first 
compute I(h(q)\f{p)) for one random projection. 

Lemma 3 If p is a random point on S(0, l/v2) and 
q is a random point on B(p,l/c), then for a random 
projection I(h(q)\f(p)) = ±(1 - o(l))2a/ s/tt « 1.47/c 

Proof: f(p) has the distribution N(Q, 1), and f(q) — f(p) 
has the distribution iV(0, 1/c). So f(p) is at distance x 
from with probability density 2n(^/2x) and in that case 
the probability that f{q) is not on the same side as f(q) 
is $(xc), so the entropy of h(q) is I(<f>(cx),l — <I>(cir)). 
So 



I(h(q)\f(p)) = / 2rj{V2x)I{§{cx),l - <S>{cx))dx 
Jo 

= - / e- x2 I{<£(cx),l- <f>(cx))dx 
Ti" Jo 

= — / e~ {x/c)2) I{<b{x),l-<S>{x))dx 

Now <&{x) < e~ x I 2 jx drops exponentially and for 
large c, e~( x / c ) drops slowly and is close to 1 until x 



becomes comparable to c. So / °° e ( X / C ) 2 /(<I>(2;), 1 — 
dx = (l- o(l)) r o °° 7($(x), 1 - $(x)) dx 

□ 

Similarly it can be shown that I(h(p)\f(q)) = 
I(H<l)\f(p)) ~ 1-47/c (see appendix [HJ. So 
I(H(p)\F(q)) < lA7k/c. But I(H(p)\F(q)) is the ex- 
pected entropy of H(p) given F(q) for random choices 
of q from B(p, 1/c). We will argue that for large d, even 
for a fixed random choices of q and /, I(H(p)\F(q)) < 
(1 + o(l))1.47fc/c with high probability of 1 - o(l): Ob- 
serve that if d > k the tuples (fi(q),fi(p)) are in- 
dependent for the k different values of i; so the sum 
^2 I(hi(p)\fi(q)) is a sum of independent random vari- 
ables in the range [0,1] each with expectation 1.47/c. 
By chernoff bounds, with high probability the sum will 
be close to the mean. Even if d < k the terms are d- 
wise independent and chernoff bounds may be applied 
to d terms at a time; the high probability bound follows 
if we assume d is large. This means by lemma |21 with 
high probability of 1 — o(l), the search time is about 

2 (l+o(l))1.47fe/c which j g ri (l+o(l))1.47/c < 

The algorithm is as follows: For n( 1 +°( 1 )) 1 - 47 / c itera- 
tions: Search a random bucket from the distribution of 
H(p) given F(q). Report the nearest neighbor among all 
points searched. 

Note that F(p) given F(q) has a normal distribution 
( appendix 17. 1(1 and so sampling with the same distribu- 
tion as H (p) given F(q) is easy. This gives an algorithm 
that takes near linear space and n < ' 1+ °^ 1A7 ^ c time. 

Remark 3 In the decision version of the nearest neigh- 
bor problem we assumed that we know the exact distance 
1 / c to the nearest neighbor whereas in earlier works, 1/c 
is only an upper bound on the distance to the nearest 
neighbor. This can easily be fixed by guessing the exact 
distance within a factor of 1 + e where e = 0(1/ log n). 
So H(p) has almost the same probability distribution as 
the nearest neighbor of q. Then it follows from remark 
Q| that we can still apply lemma to achieve the same 
result. The search time only increases by a factor of 
O(logra). 

Remark 4 In our search data structure, we have used 
a set of \ogn random hyperplanes to separate the n 
points of the database. It can be shown that if the points 
can be separated by 'thick' hyperplanes - say \ogn al- 
most orthogonal hyperplanes of thickness at least t, then 
I(h(p)\q, h) — e~°' c /' ) implying a much faster search 

time of n e if t is not too large. While such thick 

hyperplanes exist for large dimensions when d > n (see 



appendix \7.!3j) , for d « n a simple probabilistic calcula- 
tion shows that such a set of thick separating hyperplanes 
does not exist. 

Note that for large d, we need not store the entire de- 
scription of each point in the hash table but only its 
O(logn) bit hash value. With high probability, this 
should be sufficient to distinguish between points that 
are \ jc close to the query point from points that are at 
least 1 away. 

Alternatively we will show later in sectional how this 
technique can also be used to search in 0(d) time and 

n (l+o(l))/(l-1.47/c) spa(;e _ 

Although we have assumed that the points are chosen 
randomly from a normal distribution, our results in this 
section can be applied to any set of points whose pairwise 
distances are about the same. This is true when points 
are chosen randomly from other distributions such as 
from a cube. In that case we can set the origin to be the 
centroid of the point set. It can easily be shown that the 
distance of any point from the centroid is about 1/V2 of 
the interpoint distance. 

5 Generalizing to arbitrary set of 
points 

5.1 Proof of Main theorems 

We now generalize our techniques to arbitrary set of 
points. Assume without loss of generality that the near- 
est neighbor of the query point is at distance 1/c from the 
query point, and we are interested in finding any point at 
distance at most 1 from the query point. We use locality 
preserving hash functions to map database points into a 
hash table. To find the approximate nearest neighbor of 
a query point, we hash several randomly chosen points 
in the 1/c- neighborhood of the query point and show 
that a (1/c, 1) -nearest neighbor is likely to be present 
in one of these buckets. Let M denote the entropy 
I(h(p)\q, h) = M where p is a random point in B(q,r). 
Here I(h(p)\q, h) denotes the entropy of h(p) given the 
query point and the specific hash function from the hash 
family in use. Let g denote an upper bound on the prob- 
ability that two points that are at least distance 1 apart 
will hash to the same bucket. Pick k = logn/ log(l/g) 
random hash functions hi, hi, ■•, (after random rota- 
tions and shifts) and store each point p in the database in 
the bucket H(p) — (h\(p),h-z(p), ..,hk(p))- Since many 
buckets may be empty we use hashing to only store the 
non-empty buckets. 



First observe that after a random rotation and a ran- 
dom shift of the origin, the nearest neighbor of q appears 
like a random point p on B(q,l/c) (this rotation and 
shift may not be required as the hash functions may al- 
ready perform them implicitly, see section \h.2\ . We will 
show how to guess H(p) in time 0( n M(i+i/logn)/lo gff ^ 
Since we are only interested in running times where the 
exponent of n is at most 1, this is 0(n M ^ lo&g ). 

Now I(H(p)\q,H) < T!ll{hi{p)lq,hi) = kM . This 
means on an average at most kM bits are required 
to guess H(p) for a given set H of k random hash 
functions. I(H(p)\q, H) also denotes the expected value 
of I(H(p)\q) under random choices for fixing the set 
H of hash functions. So for a fixed H, by Markov 
inequality with at least probability 1/ logn, this entropy 
is at most kM(l + 1/logn). Let us assume this is the 
case. We are now ready to prove theorem ^ 

Proof: [of theorem ^ For a given set H of k hash 
functions lemmaOimplies that by picking 2 feM ( 1 + 1 /i°g™) 
random values with the same distribution as H(p), 
at least one of them is equal to H(p) with at least 
0(1/ (kM)) probability. So with one hash table with 
probability at least 0(1/ (kM)), we can find H(p) in time 
2fcM(i+i/ logn) Also picking random variables with the 
distribution as H(p) is easy: just compute H(r) where 
r is a random point from B(q,l/c). So by lemma [2 
by searching 0(2 fcM ( 1+1 / lo s™)) buckets obtained by ap- 
plying H on randomly chosen points v in B(q, 1/c), with 
probability at least 0(1/ (kM )) we find the nearest neigh- 
bor p. Setting k = logn/ log( 1/g) gives us the desired 
result. 

We also need to bound the number of far off points 
visited over the 0(2 fcM ( 1 + 1 /i°g™)) buckets searched. For 
any point t that is at least distance 1 from q, the prob- 
ability that it is visited in one bucket is at most g k . 
So out of n such possible points the expected num- 
ber of such points visited over all buckets is at most 

nff /c ( 2 fcAf(l + l/logn)) = ( 2 feM(l+l/logn))_ Sq with 

probability 1/2 at most twice as many far off points are 
visited. 

□ 

So in the end the algorithm is simple: Pick 
0(2 fcM ( 1+1 / lo s™)) random points from B(q, 1/c). Search 
the buckets these points hash to, limiting the total num- 
ber of points visited at distance more than 1 from q to at 
most 0(2 fcM ( 1 + 1 /i°g«))_ Repeat this for polylogn hash 
tables and pick the nearest found neighbor. 

Alternatively by storing p in buckets obtained by 
applying H on 2 feM ( 1+e ) randomly chosen points from 



B{p, 1/c), we can have a small search time with slightly 
more space. For a hxed random choice of H, by 
Markov's inequality the probability that I(H(q)\p) ex- 
ceeds kM(l + e) is at most e. By lemma [5] with prob- 
ability at least 0( kM ^i +e ^ ) the query point will hash to 
one of these buckets. Again how many far off points 
can be present in this bucket? A given point t in the 
database that is at distance at least 1 away from q will 
be stored in 2 fcM ( 1+c ) buckets. These buckets are H{v) 
for 0(2 feM ( 1+e )) randomly chosen values v picked from 
B(t,l/c). Again if g denotes an upper bound on the 
probability that one such random point v hashes to the 
same bucket as q, then the probability that for one such 
v, H(v) — H(q) is at most g k . 

So over 0(2 fcM ( 1+e )) choices of v from B(t,l/c), the 
probability that any of these hash to the same bucket 
as q is at most 2 fcJU ( 1+e ' g k . Out of the n points in the 
database the expected number of points that hash to the 
same bucket as q is at most n2 kM ^ 1+ ^ g k . We choose k 
so that this is at most 1, giving k = logrt/(log(l/g) — 
M(l + e)). So by Markov's inequality the probability 
that more than 2 points distance at least 1 from q hash 
to the same bucket as q is at most 1/2. Again by us- 
ing O(logn) hash tables with high probability at least 
for one of them not more than 2 far off points will be 
searched. We limit the search in each bucket to at most 3 
points. Here the size of the hash table is 0(n2' cM ( 1+€ ') = 

(9( n log(l/s)/(log(l/s)-M(l+e))) _ Q^ n l/{l-p{X+e))\ Xhis ig 

0(n 1 /( 1_p ') if e = (1 — p) 2 / logn. The total success prob- 
ability is 0( pg^y) = 0((1 - P ) 3 ) 

So we have proved the following theorem. 

Theorem 4 With probability at least 0((1 — p) 3 ) if we 
use k = logn/ (log(l/g) — M (1 + e)) projections, using a 
hash table of size 0(n 1 /( 1-p )) the search algorithm suc- 
ceeds for one hash table. With constant probability, no 
more than 0(1) points that are at a distance more than 
1 from q are searched. 

Again, by using polylogn hash tables the algorithm 
can be made to succeed with high probability. 

5.2 Choice of Hash functions for Eu- 
clidean Space 

We now apply our techniques on the locality preserving 
hash functions for Euclidean space |^ El EH ■ 

Instead of mapping f(p) to a bit we map it to an in- 
teger. As in El E], divide the real line into equal sized 
intervals of size D and add a random shift. Precisely, the 
point p is hashed to an integer h(p) = |_(/(p) + (3)/D\ = 



[{p.v + (3)/ D\ where v is a random vector from the dis- 
tribution N d (0, yfd) and j3 is a random number in [0, D\. 
H{p) = (h%(p), h2(p), hk(p)). Essentially H divides 
maps the space R k into a grid of cubes of side length D. 

Let n{p) = {fi(p) + fomodD. So R(p) = 
(ri(p), r2(p), ru(p)) denotes the relative position of 
F(p) within its cube. R{p) is uniformly distributed in 
[0, D] k . We will later set D to be about 3. 

Now consider two points p and q that are distance 
1/c apart. We will try to guess the relative position 
of p's subcube H(p) from q's subcube H(q), given the 
position R(q) of q in its subcube; that is we will try 
to guess H(p) — H{q). Under the k random projec- 
tions, F(p) — F(q) is randomly distributed according to 
N(0,l/c) k and is independent of the relative position 
R(q) in its cube as the alignments of the intervals are in- 
dependent of the projections fi. Time required to guess 
H(p) — H(q) depends on the entropy of H{p) — H(q) 
given R(q). 

The following lemma computes I(hi(p) — hi(q)\ri{q)) 

Lemma 5 If p and q are distance 1 / c apart then un- 
der a random projection, I{h(p) — h(q)\r(q)) — Ml + 
e-°^ 2D ^)2a/D where a = J(*(aj), 1 - $(x)) dx 

Proof: r(p) is a random value in [0,13]; so the prob- 
ability density that it takes value x in [0, D] is 1/D. 
h(p) — h(q) takes integral values, however, as c becomes 
large, in terms of its entropy most of it is concentrated 
at 1 and —1. Let Mi denote I{h{p) — h(q) — i\r(p)). We 
are interested in the sum Mi over all integers i. By 
symmetry Mi = M-i. If r(p) = D — x, Pr[h(p) — h(q) = 
1] = §(cx) - $(cx + cD) 
So 

M 1 = I(h(p)-h(q) = l\r(p)) 
1 f D 

= — / I($(cx) - $(cx + cD))dx 
D Jo 

1 f Dc 
cD Jo 

Again as $(x) drops exponentially, Q(x + cD) is 
negligible as compared to &(x), and further the in- 
tegral to oo is not much more as than the integral 
to Dc. So f^ c I($(x) - + cD))dx = (1 - 
e- ^ 2 ))^ 00 I(<f>(x))dx 

We have shown that Mi (and M_i) = (1 — 
e -0(c 2 D 2 ))_i_ J™ i($(x))dx. Also Mi drops exponen- 
tially with i since for a given value of r(p), Pr[h{jp) — 
h{q) = i] drops exponentially with a factor of e~( Dc ^ I" 1 . 



2 r D/2 

M = — / Z(l-$(ca;)-$(.Dc-a:c))da; 
" Jo 

= — / I(l-®(x)-$(Dc-x))dx 
cD Jo 

Again as before we argue that in the range [0,Dc/4], 
<J>(Dc— x) is negligible as compared to and beyond 

that they are both negligible (e _0(c D )). This gives us, 

M = (1 + e -°( c2l)2 )).— / 7(1 - $(a;)) dx 
c D Jo 

So, £\M = (1 + e-°( c2l)2 ))(A/_ 1 + M + Mi) = 

il + e -0(c^ )) ^^ mx)A _ Hx))dx 

□ 

The following lemma computes the probability g that 
a point t at distance at least 1 from q hashes to the same 
value h(t) as h(q) under one projection. 

Lemma 6 g = l- ^yf (1 - er D ' 2 1 2 ) 

Proof: If f(t) and f(q) are x apart then the probabil- 
ity that they are separated by the interval boundaries 
is x/D. A simple computation shows that the proba- 
bility 1 — g that t and q hash to different values in one 

projection is 2 J Q D (x/ D)rj(x) dx = i^f (1 - e _£,a / 2 ) □ 

Now since the function r is implicit in the description 
of the function h, I(h(p)\q, h) < I(h(p) - h(q)/r(q)). So 
by theorem ^ we have: 

Corollary 7 A c- approximate nearest neighbor in the 
Euclidean space can be found in time 0(n p ) using a data 

structure of size 0(n) where p = 2a/[Dlog(l — — 
e-° 2/2 ))} 

Setting D = 3 gives the value of p = 2a/ 1.26 « 2.06 
Alternatively, using a data structure of size 
0(n 1 /( 1-p - 1 ), we can perform the search operation in O(d) 
time: again, if d(t,q) > 1, the upper bound of g still 
holds on the probability that a random point in B(t, 1/c) 
hashes to the same value as q. This is because under the 
random projections in use, f{r) — f(q) has the same dis- 
tribution as that of a point at distance yf(d(t, q) 2 + 1 /c 2 ) 
from q - clearly this distance is greater than 1. 



Remark 5 Although the converse oflemma\^is not al- 
ways true - that is, it is not necessary that 2 I ( X ^ random 
samples for a required to guess the value of a random 
variable - it can be shown that for the specific hash func- 
tions in consideration this is the case. That is, we need 
2(i±o(i))feM/ iog(i/g) ran( i m samples to guess the value 
of H(p) given F(q). The essential idea is to consider 
different values of f(q) in small increments of e/c and 
argue that the number of projections for which R(q) lie 
in a small interval is close to the expected value with high 
probability and then argue that we need close to the cor- 
responding number of guesses for those set of intervals. 

6 Implementation Discussion 

We may assume that d is at most 0(log n) as for larger 
d we can use dimension reduction techniques that pre- 
serve distances. Alternately we may use O(logn) lo- 
cality preserving hash functions to represent a point 
in the database. So we need not store the entire de- 
scription of each point in the hash table but only its 
O(logn) size hash value. This makes the size of each 
hash entry small especially if we know that there is a 
unique (r, cr)-approximate nearest neighbor. More suc- 
cinct representations that use close to at most 2 log n 
bits can be obtained by first embedding the points into 
a high-dimensional hamming metric and then reducing 
the number of dimensions to about O(logn) by XORing 
suitable sized random subsets of the bits (see lemma 1 
in PU). If the nearest neighbor is unique then in the 
final representation, each bit of the query and the near- 
est neighbor will differ with probability at most l/(2c) 
whereas for other neighbors each bit position will dif- 
fer with probability at least |(1 — 1/e). A simple and 
tight probability calculation shows that for large enough 
c, 2 log n bits suffice with high probability to distinguish 
the nearest neighbor from the other points. Note that 
the hash key H(p) need not be stored explicitly as it 
suffices to hash this into an index for the hash array. 

While we have included several polylogn factors in the 
space complexity these are unlikely to be required in 
practice. The first logn factor comes from Lemma El 
and is required only for arbitrary random variables. In 
our case since the entropy is obtained by adding several 
different independent random variables it is easy to show 
that this is not required. The second logn comes from 
the application of Markov inequality on the entropy dis- 
tribution. This again can be eliminated by using say 
Chebyshev or Chernoff bounds. The third one arises by 
the crude application of Markov's inequality to ensure 



that not too many far off points are examined in each 
hash table. Again we expect this will not be really re- 
quired in practice. So for large enough constant c, if 
we are searching for a unique (r, cr) nearest neighbor, 
the total amount of space required in practice is close to 
2nlogn bits. Even for n equal to a million, this is the 
only 5MB which is a tiny fraction of the main memory 
space available on PC's. 
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7 Appendix 

7.1 I(h(p)\f(q)) for Random Instance 

We will show that I{h{p)\f{q)) = I(h(q)\f(p)) for the 
random instance of nearest neighbor search in Euclidean 
space presented in section 0J p is a random point 
distributed as N d (0, l/y/2), and q is distributed as 
N d (p, 1 /c) . We will compute the probability density that 
f(q) = y, and conditioned on this the probability that 
h{p) yt h{q). 

After the random projection, f(q) is distributed 
as N(0, VT 1 / 2 + Vc 2 ))- Also , the probability den- 
sity function of f(p) conditioned on f(q) = y is 
Pr[f(p) = x}Pr[f(q) = y\f(p) = x}/Pr[f(q) = 
y] = r](xV2)r]((x - y)c)/r)(y/ t/(1/2 + 1/c 2 )), which is 
rj(c 2 y/(2+c 2 ), 1/ ^/(2+c 2 )), the normal distribution with 
mean c 2 y/(2 + c 2 ) and deviation l/y/{2 + c 2 ). 

So given that /(g) = y, probability that h(p) ^ h{q) 
is $(vT2 + c 2 )c 2 y/(2 + c 2 )) = $(ct// V / (1 + 2/c 2 )). Since 
the probability density function of f(q) is also given by 



rj(V2y/ v r (l + 2/c 2 )), this results in the same calculation 
as for I(h(p)\f(q)) except that the variables are scaled by 
a factor of ^(1 + 2/c 2 ). So, I(h(q)\f(p)) = I(h(p)\f(q)). 

7.2 Thick Hyperplanes 

Let us consider the case when d is very large say at least 
nlogn. In that case we choose special hyperplanes that 
better separate the set of points. The hyperplanes are 
obtained as follows. 

If vi,..,v n denote the points of the database, choose 
cii randomly to be either +1 or —1 and set h = Y] diVi. 
Observe that h is a random variable from N d (0, Vd). 
So if p is random point in B(q,r) then h.p — h.q is dis- 
tributed as N (0, r) We will show that h separates the set 
of points well. Indeed, look at h.Vi = a,i\vi\ 2 +J2 a j v i- V j- 
Note that Vi.Vj is very small, distributed as A^(0, 1/Vd). 
So the sum is distributed as A^(0, ^n/d)). \vi\ 2 is con- 
centrated around 1 and at least 1 — e with high prob- 
ability (at least 1 — exp(— 0{d))). Also the sum term 
is at most e with probability at least 1 — exp(0(log n)). 
So with high probability over the log n projections for 
all such h, \h.Vi\ > (1 — e). Now, since the probability 
that h.q ^ h.p is clearly at most exp(— 0(c 2 )), we have 
I(h(q)/f(p)) is 0(c 2 exp(-0(c 2 ))).^ 

This argument can also be applied if d is as small as 
n but deriving the appropriate hyperplanes may require 
solving a system of equations. If a is the column vector 
with entries as di, then h is obtained by solving Ah = a, 
where the rows of A are the point vectors Vi. 



